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Abstract 

Background: Many computational programs have been developed to identify enriched regions for a single biological 
ChlP-seq sample. Given that many biological questions are often asked to compare the difference between two 
different conditions, it is important to develop new programs that address the comparison of two biological ChlP-seq 
samples. Despite several programs designed to address this question, these programs suffer from some drawbacks, 
such as inability to distinguish whether the identified differential enriched regions are indeed significantly enriched, lack 
of distinguishing binding patterns, and neglect of the normalization between samples. 

Results: In this study, we developed a novel quantitative method for comparing two biological ChlP-seq samples, 
called QChlPat. Our method employs a new global normalization method: nonparametric empirical Bayes (NEB) 
correction normalization, utilizes pre-defined enriched regions identified from single-sample peak calling programs, 
uses statistical methods to define differential enriched regions, then defines binding (histone modification) pattern 
information for those differential enriched regions. Our program was tested on a benchmark data: histone 
modifications data used by ChlPDiffs. It was then applied on two study cases: one to identify differential histone 
modification sites for ChlP-seq of H3K27me3 and H3K9me2 data in AKTI-transfected MCF10A cells; the other to 
identify differential binding sites for ChlP-seq of TCF7L2 data in MCF7 and PANC1 cells. 

Conclusions: Several advantages of our program include: 1) it considers a control (or input) experiment; 2) it 
incorporates a novel global normalization strategy: nonparametric empirical Bayes correction normalization; 3) it 
provides the binding pattern information among different enriched regions. QChlPat is implemented in R, Perl and 
C++, and has been tested under Linux. The R package is available at http://motif.bmi.ohio-state.edu/QChlPat. 



Background 

ChlP-seq (chromatin immunoprecipitation (ChIP) cou- 
pling with DNA sequencing) is widely used to precisely 
map the location of transcription factor (TF) binding or 
histone modification sites at a genome-wide scale [1-5]. 
Other related sequencing-based techniques DNase-seq [6], 
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FAIRE-seq [6] are often used to define the open chromatin 
regions and identify accessible regulatory regions. Many 
computational programs [4,7-9] have been developed to 
identify enriched regions (referring to either binding sites 
or histone modification sites) for the ChlP-seq data. How- 
ever, a majority of these programs were developed for a 
single ChlP-seq sample data. Given that many biological 
questions are often asked to compare the enriched regions 
under two different conditions, such as before and after 
using drugs, with and without chemical or hormone treat- 
ment, binding patterns of different transcription factors, or 
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one transcription factor binding information in two cell 
types, it is critical to develop new programs to meet this 
need and to quantitatively compare two biological ChlP- 
seq samples. In our recent study [10], we found following 
TGFp stimulation of the A2780 epithelial ovarian cancer 
cell line, ChlP-seq identified SMAD4 binding loci were 
classified into four distinct binding patterns: 1) Basal; 2) 
Shifted; 3) Stimulated Only; 4) Unstimulated Only, indi- 
cating that TGFp stimulation alters SMAD4 binding pat- 
terns in epithelial ovarian cancer cells. However, the 
binding patterns were measured on a qualitative basis. 
Therefore, it is essential to develop a quantitative method 
to address this increasingly generalized biological question. 

A study from Xu et al [11] employed a Hidden Markov 
Model (HMM) to identify the differential binding enrich- 
ments from two ChlP-seq samples (ChlPDiff), which was 
tested on three histone modifications data in two different 
cell types. However, the ChlPDiff program suffers from 
some shortcomings. Firstly, the HMM-based algorithm 
itself cannot distinguish whether those identified differen- 
tial enriched regions are indeed significantly enriched or 
not. This is because in many cases a more enriched region 
in sample A comparing to sample B isn't necessarily an 
enriched region in sample A itself and vice versa. There- 
fore, the program tends to detect many false positive mod- 
ification sites. Secondly, this program directly uses the 
samples' sequencing reads to perform the comparison 
without considering the information in the control experi- 
ment of each sample. However, as noted by Zhang et al. 
[9], the control sample is critical for modeling the sequen- 
cing and mapping biases. Thirdly, it is questionable 
whether this program is suitable for narrow peaks compar- 
ison since the program sets the bin size at least 1 kb. For 
example, the peak width for a typical TF binding site is 
around 300-500 bp. Fourthly, it fails to define a binding 
pattern for each identified differential binding site after 
comparing two samples. A few other studies have made 
the same efforts to develop programs, DBChIP [12], DIME 
[13], to detect differential binding sites in two samples, but 
these programs suffer from the same drawbacks either 
lacking distinguishing binding patterns or neglecting the 
normalization between samples. 

In this study, we developed a novel quantitative method 
for comparing two biological ChlP-seq samples, called 
QChlPat, not only to detect the differential binding sites 
but also to classify them into distinct binding patterns. 
Our method employs a new global normalization method 
(nonparametric empirical Bayes correction normalization) 
[14-17], utilizes pre-defined enriched regions identified 
from single-sample enriched regions identification pro- 
grams, uses statistical methods to define differential 
enriched regions, then defines binding pattern information 
for those differential enriched regions. Our program was 
tested on a benchmark data, histone modifications data 



[18] used by ChlPDiffs [11]. It was then applied on two 
study cases: one to identify differential histone modifica- 
tion sites for ChlP-seq of H3K27me3 and H3K9me2 data 
in AKTl-transfected MCF10A cells; the other to identify 
differential binding sites for ChlP-seq of TCF7L2 data in 
MCF7 and PANC1 cells. 

Results 

Evaluation of QChlPat with benchmark data 

A benchmark testing data, the ChlP-seq of histone mod- 
ification data from Mikkelsen et al [18], including 
H3K27me3 (K27), H3K4me3 (K4) and H3K36me3 
(K36), was used for evaluating our program. The reason 
for choosing this data for the evaluation is the following: 
1) it was used by ChlPDiff [11] and is easily compared 
to our program; and 2) differential histone modification 
sites have been validated in the same study [18]. A sum- 
mary of the results for comparing different methods are 
shown in Table 1. We found that compared with ChlP- 
Diff, QChlPat identified fewer differential histone modifi- 
cation sites (DHMSs). Two possible reasons may account 
for this: 1) the peak calling program BELT used in QChl- 
Pat identified a more stringent set of initial enriched 
regions compared with the one in ChlPDiff; and 2) the 
Wilcoxon rank test used in QChlPat removed less signifi- 
candy different enriched regions. In order to minimize the 
bias from peak calling programs, other programs MACS 
[9] and FindPeaks [7] were also used for a comparison 
(Additional file 1 - Supplemental Table SI). Both pro- 
grams identified even fewer DHMSs than BELT did. In 
addition, BELT was able to detect more DHMSs with 
"Only" patterns and "Shift" patterns compared with 
MACS and FindPeaks. 

Comparison of different normalization methods 

ChlP-seq samples performed in different conditions and 
time usually have different numbers of raw sequencing 
reads due to the number of sequencing lanes conducted 
for each sample (sequencing depth) and multiple-matched 
or non-matched matched reads in the sample. Before 
comparing two samples, it is necessary to normalize the 
read counts to eliminate system errors and minimize 
sequencing depth difference between two samples. In our 
program, we employed three normalization methods 
including linear normalization, nonparametric empirical 
Bayes correction normalization and quantile normaliza- 
tion, which are based on different assumptions of the data 
distribution. A comparative summary of the three different 
normalization methods is shown in Table 1. The results of 
the method without using any normalization methods 
were also reported. The results of ChlPDiff were used as a 
baseline to gauge the performance of each different nor- 
malization method. For each method, the percentage of 
DHMSs overlapping with ChlPDiff results is shown in 
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Table 1 A summary of comparing for the benchmark data with a bin size of 30 bp. 



Methods H3K4me3 H3K36me3 H3K27me3 





ESC enriched 


NPC enriched 


ESC enriched 


NPC enriched 


ESC enriched 


NPC enriched 


Linear normalization 


5738 (87.9%)* 


913 (61.4%) 


779 (29.9%) 


1403 (58.7%) 


3483 (77.2%) 


495 (50.9%) 


NEB normalization 


6912 (78.2%) 


621 (71.3%) 


951(25.6%) 


1080 (70.8%) 


3664 (72.3%) 


428 (55.1%) 


Quantile normalization 


3918 (98.3%) 


2611 (31.3%) 


526 (43.3%) 


1952 (35.9%) 


2875 ( 85.8%) 


793 (38.8%) 


Non-normalization 


6959 (78.2%) 


680 (70.3%) 


741 (31.4%) 


1409 (58.5%) 


3266 (79.7%) 


519 (49.3%) 


ChlPDiff 


12975 


1767 


1157 


1227 


3832 


888 



•Values in parentheses are percentage of DHMSs overlapping with ChlPDiff. 



parentheses. The results showed that the DHMSs identi- 
fied by quantile and NEB normalization methods highly 
overlap with those predicted by ChlPDiff, while the linear 
normalization method and non-normalization method 
have fewer overlapping DHMSs with ChlPDiff. For all the 
three data, the overlapping percentage between quantile 
and ChlPDiff is the highest one among all the three nor- 
malization methods in ESC, while NEB method achieves 
the highest overlapping percentage in NPC. Furthermore, 
the NEB method identifies more DHMSs than the quantile 
method in ESC and a similar number of DHMSs as that of 
the quantile method in NPC. Therefore, the overall perfor- 
mance of the NEB is better than that of other normaliza- 
tion methods on the three data. The results are not 
surprising since NEB normalization has two advantages 
compared with other normalization methods. Firstly, NEB 
estimates the proportion of the reads as p^n^lS, where «! 
is the frequency of the read with count 1 and S is the 
sequencing depth of the ChlP-Seq data. Secondly, NEB 
adjusts read counts based on both the observed read 
counts and the nature of the frequency distribution of the 
read counts. 

Optimization of the NEB-based method 

In order to further optimize the NEB-based method, we 
tested the bin size parameter used in Wilcoxon rank 
test in our QChlPat program, to determine how this 
parameter would affect the number of the enriched 
regions and the types of binding patterns identified by 
our program. To optimize bin size and determine how 
its variance would affect the results, we first tested a 
range of different bin sizes from 5 to 60 bp for each bin, 
in which each enriched region was divided into equal 
length bins (see Methods). We found that the number 
of DHMSs for each sample didn't significantly change 
with different bin size tested (Figure 1A) when the NEB 
normalization method was used. However, when the bin 
size is higher than 10 bp, the number of identified 
DHMSs is very similar, suggesting that the optimal bin 
size for the NEB method may be between 20 and 40 bp. 
The binding pattern results show the same trend. The 
number of differential binding patterns remained stable 
when the bin size is higher than 10 bp (Figure IB). 



Application to K9 and K27 datasets 

To further test the performance and efficiency of QChl- 
Pat, we applied it to a study case, ChlP-seq of H3K27me3 
and H3K9me2 in AKTl-tranfected MCF10A vs vehicle 
control, conducted in our laboratory. Many studies have 
shown that among many types of histone modifications, 
H3K27me3 and H3K9me2, two repressive marks are cri- 
tical for normal and aberrant differentiation of stem and 
progenitor cells [19] as well as in the development of 
cancers [20]. Our previous study also demonstrated that 
epigenetic silencing of a set of AKTl-mediated genes 
[21]. AKT1 kinase is a key downstream effector of the 
phosphoinositide 3-kinase (PI3K) signaling pathway that 
regulates diverse cellular functions, including growth, 
proliferation, survival, metabolism, motility, angiogenesis, 
and vesicle trafficking [22,23] . Therefore, it is important 
to determine AKTl-mediated genome-wide histone 
modification patterns for H3K27me3 and H3K9me2 and 
its subsequent influence on downstream target genes. We 
chose the NEB normalization method since it performed 
best on the benchmark data H3K27me3, H3K4me3 and 
H3K36me3 compared to other three methods (see section 
Comparison of different normalization methods). 

We have identified a total of 10,067 DHMSs for 
H3K27me3 in AKTl-transfected MCF10A cells. Of these, 
a majority of them (80.7%) were classified into "Only" or 
"Shift" binding patterns for AKTl-tranfected MCF10A 
cells. For H3K9me2 mark, of 2,532 DHMSs in AKTl- 
transfected MCF10A cells, 87.9% were either "Only" or 
"Shift" binding patterns (Table 2). The detailed informa- 
tion of the DHMSs is in Additional file 2, 3. Our results 
suggests that AKT1 signalling may trigger the switching 
of H3K27me3 or H3K9me2 modification sites, i.e. modi- 
fying different sets of nucleosomes on the genome, result- 
ing in marking a totally different set of target genes in the 
case of "Only" binding. We further correlated the genes 
with DHMSs with a set of 2,893 AKTl-mediated differen- 
tially expressed genes in MCF10A cells, found that 661 
genes have at least one DHMS for H3K27me3 or 
H3K9me2 marks (Figure 2A). Interestingly, we found that 
the DHMSs for H3K27me3 generally showed higher 
enrichment than those of H3K9me2. However, there was 
no significant difference in DHMSs' enrichments for 
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Figure 1 The optimization of NEB-based method. (A) The total number of DHMSs detected by the NEB normalization method with varying 
bin sizes. (B) The number of binding patterns identified with varying bin sizes. 



AKTl-mediated up- or down-regulated genes. We further 
performed GO analysis on the sets of genes enriched for 
both H3K27me3 and H3K9me2 marks (Figure 2B and 
2C). We defined two kinds of enriched genes: positive and 
negative enriched genes. Positive enriched genes are genes 
that are more enriched with either of two histone marks 



Table 2 The binding patterns identified for the K9 and 
K27 datasets. 



Sample (MCF10A) 


Num. of DHMSs 


Only 


Shift 


H3K27me3 (AKT1 -transfected) 


1 0,067 


6,799 


1333 


H3K27me3 (vehicle control) 


1,126 


463 


438 


H3K9m2 (AKT1 -transfected) 


2,532 


1346 


879 


H3K9m2 (vehicle control) 


6,508 


4507 


1155 



compared to the vehicle control (Figure 2B, common posi- 
tive enriched genes between two datasets K9 and K27); 
while negative enriched genes less enriched with neither 
marks than the vehicle control (Figure 2C, common nega- 
tive enriched genes between two datasets K9 and K27). 
Interestingly, the positive genes that are enriched in both 
the histone marks are highly enriched in the categories of 
tissue specific cellular functions, such as endocrine system 
disorders, gastrointestinal disease (Figure 2B). In contrast, 
the common negative enriched genes which lack these 
two histone marks are more enriched in cancer related 
genes (Figure 2C). The function of both groups of genes 
are epigenetically silenced by the presence of these histone 
modifications, while cancer related genes might be acti- 
vated due to the loss of the epigenetic effect, which is 
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Figure 2 Application of QChlPat on K27 and K9 datasets. (A) Heatmap showing AKT1 -mediated differential genes with enriched K27 and K9 
marks. (B) GO analysis on common genes with both K9 and K27 marks showing high enrichment in the categories of tissue specific cellular 
functions. (C) GO analysis on common genes that lack either of two histone marks showing enriched in cancer related genes. 



consistent with the stem cell-like characteristics of tumour 
cells [24-26]. Next, we wanted to examine if there is func- 
tional difference between these two sets of genes that are 
marked by the two different histone modifications. Not 
surprisingly, genes which are positive with either mark in 
the AKTl-transfected cells show more tissue specificity. 
However, genes lacking H3K27me3 in AKTl-transfected 
cells are realted to cell death whereas the genes without 
H3K9me2 are more enriched in cancer genes. 

Application to TCF7L2 in MCF7 and PANC1 cells 

In order to evaluate the performance of QChlPat on a par- 
ticular transcription factor, we further tested QChlPat on 
ChlP-seq of the TCF7L2 in MCF7 and PANC1 cells. 
Using a set of parameters in which a bin size was 30 bp, 
p-value was less than 0.05, ratio was 1.2, difference was 
0.8, BELT threshold was 0.91 for TCF7L2 in MCF7, and 
BELT threshold was 0.92 for TCF7L2 in PANC1, QChlPat 
identified 1,838 "MCF7 Only," 89 "MCF7 Shift," 27,163 
"MCF7 Unchanged," 1,341 "PANC1 Only," 91 "PANC1 
Shift," and 24,882 "PANC1 Unchanged" binding sites. 
Screenshots of examples for each type of binding pattern 
are shown in Figure 3A. We then examined the accuracy 



by correlating those "Only" binding pattern with their 
associated genes' expression levels. As expected, we found 
that genes associated TCF7L2 "MCF7 Only" have higher 
expression values in MCF7 cells than in PANC1 cells 
while genes associated TCF7L2 "PANC1 Only" have 
higher expression values in PANC1 cells than in MCF7 
cells (Figure 3B). This result strongly demonstrated the 
accuracy of QChlPat. 

We next performed IPA analysis on genes associated 
with MCF7 and PANC1 "Only" TCF7L2 binding sites 
(Figure 3C). Interestingly, we found that the genes asso- 
ciated with "MCF7 Only" binding sites differed signifi- 
cantly from the genes associated with "PANC1 Only" 
binding sites in terms of the three IPA categories of net- 
work functions, physiological system development and 
function, and canonical pathways. For all three gene 
expression categories, only cancer and post-translational 
modification (network functions) showed up for both 
cases of genes. This functional difference between genes 
associated with "MCF7 Only" TCF7L2 binding sites com- 
pared to genes associated with "PANC1 Only" TCF7L2 
binding sites is expected, because the genes are differen- 
tially regulated by definition of the "Only" binding 
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Figure 3 Application of QChlPat to TCF7L2 datasets. (A) Visualizations of read counts per bin and genomic locations of various different 
TCF7L2 binding patterns in MCF7 and PANC1. (B) Four boxplots of log 2 (FPKM) gene expression values. Before taking the log 2 of the FPKM 
values, FPKMs equal to zero were removed and FPKMs between 0 and 1 were changed to 1. I) The boxplot of the MCF7 log2(FPKM) gene 
expression values for the genes associated with "MCF7 Only" TCF7L2 binding sites. II) The boxplot of the PANC1 log 2 (FPKM) gene expression 
values for the genes associated with "MCF7 Only" TCF7L2 binding sites. Ill) The boxplot of the PANC1 log 2 (FPKM) gene expression values for the 
genes associated with "PANC1 Only" TCF7L2 binding sites. IV) The boxplot of the MCF7 log 2 (FPKM) gene expression values for the genes 
associated with "PANG Only" TCF7L2 binding sites. (C) IPA analysis of genes associated with TCF7L2 "MCF7 Only" and "PANC1 Only" binding 
sites respectively. 



pattern: genes associated with "Only" differential 
enriched regions (TCF7L2 binding sites in this case) in 
one sample are not associated with any differential 
enriched regions in the other sample, Therefore, the sig- 
nificant difference in function of the two groups of genes 
supports the accuracy of QChlPat's "Only" pattern 
detection. 

Discussion and conclusions 

In this study, we developed a quantitative method, QChl- 
Pat, to identify distinct binding patterns for two biologi- 
cal ChlP-seq samples, and the program is implemented 
in R, Perl and C++, and run in Linux system. Using a 
benchmarking test data, histone modifications data [18], 
we compared the performance of three different normali- 
zation methods on identification of DHMSs, including 
linear normalization, NEB correction normalization and 
quantile normalization, then compared it with ChlPDiffs 
program [11]. Although each normalization method has 
its own advantages, we found that NEB correction nor- 
malization is a more suitable normalization method in 
our QChlPat. Two notable advantages of NEB correction 



normalization are: 1) NEB estimates the proportion of 
the reads using the frequency of the read against the 
sequencing depth, which avoids both underestimating 
the proportion of highly abundant reads and overestimat- 
ing the proportion of low and intermediate abundance 
reads, and 2) NEB adjusts read counts based on both the 
observed read counts and the nature of the frequency 
distribution of the read counts. This is able to correct 
abundant reads among high and low regions. Therefore, 
the enriched regions can be correctly identified. In addi- 
tion, QChlPat has implemented a novel feature: categor- 
izing binding (or modification) sites into three distinct 
binding patterns. This novel feature implementation is 
very important since quantitatively associating each of 
the differential binding sites with one of two samples is 
essential to reveal key biological functions. For example, 
Once we identify "Only" binding of genes in Sample A 
and Sample B, we are able to perform GO or other path- 
way analyses on these genes and may identify some inter- 
esting biological functions. This was demonstrated in our 
two study cases (Figures 2 and 3). Thus, compared to 
other similar programs, ChlPDiffs [11], DBChIP [12], and 
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DIME [13], QChlPat is able to not only identify differen- 
tial enriched binding regions between two samples but 
also to further classify these regions into distinct binding 
patterns associated with each sample. 

Our QChlPat was further tested and validated on two 
study cases, ChlP-seq of H3K27me3 and H3K9me2 in 
AKTl-tranfected MCF10A vs. vehicle control, and ChlP- 
seq of TCF7L2 in MCF7 vs. PANC1 cells, both from our 
previous studies. Interestingly, both H3K37me3 and 
H3K9me2 marks showed two major binding patterns 
"Only" or "Shift" binding at AKTl-tranfected MCF10A 
cells (Table 2). Although our previous study has shown 
[21] AKT signaling can be a trigger of the epigenetic silen- 
cing at many downstream target genes through the cross- 
talk between DNA methylation and H3K27me3, it didn't 
show the relationship between H3K37me3 and H3K9me2. 
Nevertheless, it is a little bit surprising that both AKT1- 
mediated up- and down-regulated genes have at least one 
differential binding site for H3K27me3 or H3K9me2 
marks but there is no significant difference in enrichments 
since both histone marks are recognized as repressive 
marks. It is possibly that activated AKT signaling may trig- 
ger downstream key transcription factors and further 
dynamically regulate epigenetic processes for the interplay 
of these two histone modifications. A more thorough 
underlying mechanism for this epigenetic process needs to 
be elucidated in an experimental study in the future. The 
test of QChlPat on ChlP-seq of TCF7L2 in MCF7 and 
PANC1 cells was used to evaluate its accuracy in detecting 
binding patterns of a particular transcription factor. 
The correlation between gene expression levels with 
the detected MCF7 and PANC1 "Only" TCF7L2 binding 
sites demonstrated that QChlPat is also applicable to com- 
pare narrow peaks in two ChlP-seq samples such as TFs 
in addition to broad peaks such as repressive histone 
modifications. 

Several notable advantages of QChlPat include: 1) this 
software is able to make use of the information in control 
experiment; 2) this software incorporates a novel global 
normalization strategy: nonparametric empirical Bayes 
correction normalization; and 3) this software provides the 
binding pattern information among different enriched 
regions. 

Nevertheless, there are several limitations. First, although 
three binding patterns were presented in this study, 
another binding pattern that may be interesting is where 
there is a sharp enriched region in one ChlP-seq sample, 
while a broad enriched region in the other ChlP-seq sam- 
ple, but the total number of mapped reads are approxi- 
mately the same. We look forward to deeper investigations 
on this binding pattern and will consider this pattern in 
the updated version of QChlPat in the future. Second, in 
this study, we only focus on comparing two biological 
ChlP-seq samples and identifying their differential binding 



patterns. Another interesting task is to identify differential 
binding patterns from two groups of ChlP-seq samples, 
which requires the comparison of more than two samples. 
We are studying new mathematical and statistical models 
for quantitatively identifying binding patterns from two 
groups of ChlP-seq samples in future work. 

Methods 

Program description 

The flowchart of the quantitative method QChlPat is 
shown in Figure 4A. Given two biological ChlP-seq sam- 
ples, the first step is to normalize the two sample data in 
order to eliminate system errors and minimize sequencing 
depth difference. Three different statistical normalization 
methods are incorporated into the program, including lin- 
ear normalization [27], nonparametric empirical Bayes 
correction normalization (NEB) [14] and quantile normali- 
zation [28]. The second step is to identify enriched regions 
by peak-calling programs. Three programs, including 
BELT1.0 [29] developed in our laboratory, MACS1.4.0beta 
[9] and FindPeaks4.0 [7] being widely used in the commu- 
nity, are currently employed in our QChlPat program. 
The third step is to distinguish the significantly differential 
enriched regions (DERs) between two ChlP-seq samples 
by using Wilcoxon rank test. The last step is to classify 
these DERs into three distinct binding patterns based on 
annotated genes information. Three distinct binding pat- 
terns are defined in our program: 1) "Unchanged" binding; 
2) "Shift" binding; and 3) "Only" binding (Figure 4B). The 
program is implemented in R. BELT1.0 (the default peak 
calling program in the R package) is a program previously 
developed by our group. The parameter precision is used 
to specify threshold percentage number, followed by dou- 
bles [0.5, 1), indicating the threshold percentage number 
that will be used to calculate the threshold enrichment 
level. This parameter can be used to adjust FDR. Usage of 
this program requires installation of Java. A detailed 
description of the selection of optimal parameters for 
these programs is in Additional file 1. 

Normalization methods 

Three optional normalization methods are provided in 
QChlPat, including linear normalization, nonparametric 
empirical Bayes correction normalization and Quantile 
normalization, which are based on different assumptions 
of the data distribution. 

1) Linear normalization [27]: Linear normalization is the 
simplest and most straightforward way to normalize ChlP- 
seq data by scaling the total number of reads of different 
ChlP-seq samples into the same level. This normalization 
method is reasonable if the total number of reads in the 
two samples is roughly the same. In our program, linear 
normalization is used, in which all reads in a sample is 
divided by the total number of reads in the sample so that 
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Figure 4 Flowchart of QChlPat and an example. (A) The flowchart of QChlPat depicts the steps to identify distinct binding patterns for two 
ChlP-seq samples. (B) This figure shows the comparison between two ChlP-seq samples: ESC H3K36me3 and NPC H3K36me3. The top subfigure 
displays two ESC "Only" H3K36me3 peaks compared zero NPC H3K36me3 peaks. The bottom subfigure shows an NPC "Shift/Unchanged" 
H36Kme3 peak compared with two ESC H3K36me3 peaks: one "Shift" (left) and the other "Unchanged" (right). The NPC H3K36me3 peak is 
labelled "Shift/Unchanged" because it corresponds both as a "Shift" peak in comparison with the left-hand ESC H3K36me3 peak and as an 
"Unchanged' peak in comparison with the right-hand ESC H3K36me3 peak. 



the reads in two samples are in the same scale. This can be 
formulated as p = x/n, where x is the number of reads in 
each bin (in order to calculate the distribution of the data, 
the whole genome is divided into small fixed size regions 
which are called bins in this study, and then the total 
number of reads in each bin is counted), proportion p is 
the normalized reads number in a given bin and n is the 
total number of reads of the ChlP-Seq data. 

2) Nonparametric empirical Bayes correction normali- 
zation (NEB) [14]: The drawback of the linear normali- 
zation is that the proportion p of the missing reads 
(because the sequencing depth is not enough, some 
reads would be missing) from the data is assigned to be 
zero, which underestimates the p of highly abundant 
reads and overestimates the p of low and intermediate 
abundance reads. Therefore, the nonparametric empiri- 
cal Bayes correction normalization (NEB) is used to 
solve this problem. The nonparametric empirical Bayes 
correction normalization has been successfully applied 
to normalize the different SAGE libraries with different 
sequencing depth [9]. In this study, the Good-Turing 
estimator (SGT) [30] is used to implement the empirical 
Bayes. Given a sample with total number of reads of S, 
the aim of empirical Bayes estimation is to estimate the 
true proportion of read i (Pi) from the data. The 
observed read count c is estimated as: 



cn c +i + n c+ i 



n, 



(1) 



Where n c is the number of reads with count c. The 
proportion of undetected reads Pq is estimated as: 



Po = 



(2) 



Where S is the sequencing depth (S = « 1 +« 2 +--')- The 
empirical Bayes estimator for proportion of a read with 
count c (P c ~) is renormalized as: 



Pl = 



(3) 



Where S* is the corrected total read count after SGT: 



(4) 



3) Quantile normalization [28]: If the distribution of the 
two samples is assumed to be the same, quantile normali- 
zation can be applied to the ChlP-seq data for further 
comparison. In this package, the read number distribution 
of the first sample is used as the reference distribution and 
the second sample is transformed to make sure that the 
distribution of the two samples is the same. The transfor- 
mation can be formulated as x norm 2 = F [F2(x)).F i is 
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the distribution of the first sample and F 2 is the distribu- 
tion of the second sample. Since the Wilcoxon rank test 
only makes use of the rank of the read number, choosing 
either the first or second sample does not affect the final 
comparison results. 

Comparison of enriched regions 

In order to reduce the number of false positive enriched 
regions and make use of the samples obtained by control 
experiments, three optional publicly available enriched 
region identification programs are included in our 
method, including BELT [29], MACS [9] and FindPeaks 
[7]. The users are able to select one based on their data- 
sets and interests. After identifying the enriched regions 
in each sample, each enriched region is divided into 
equal length bins. The number of bins in each enriched 
region is a parameter, which needs to be optimized based 
on the experiments. If the length of the enriched region 
is less than the bin number, the bin number is automati- 
cally set to the length of the enriched region. Since the 
length of enriched regions may be different from each 
other, the length of bin may vary for different peaks. 
After the normalized read number is counted, the num- 
ber of reads in each bin in the corresponding location in 
the comparing sample is also calculated, and then the 
two groups of reads number can be used for Wilcoxon 
rank test. 

Wilcoxon rank test 

There are two options of Wilcoxon rank tests: Wilcoxon 
signed rank test and Wilcoxon rank sum test. If it is 
believed that the read number of the two ChlP-Seq data 
is paired, Wilcoxon signed rank test should be used (wil. 
paired=TRUE). Let Z, = X t - Y, for i = 1, ... b n , and 
assume that the Z ; is independent; each Z ; comes from 
a same continuous population and is symmetric about 
its median 9. If not paired, Wilcoxon rank sum test 
should be used (wil.pair=FALSE). Let F denotes the dis- 
tribution of X, the read enrichment of Sample A, and G 
denotes the distribution of Y, the read enrichment of 
Sample B. Assume that F and G have the same shape 
and they differ only by median: G{t)=V{t-0) [31]. 

H 0 : 9=0. (The read enrichment difference between 
Sample A and B is not significant); 

H x : 9>0 (Sample A has more significant read enrichment); 

H 2 : 6<0 (Sample B has more significant read 
enrichment). 

For signed rank test, let R t represents the rank of each 
ordered |Z,| and (j>i denote whether Z ; is positive. When 
Z, is positive, <p t is 1. Otherwise, it is -1. The Wilcoxon 

signed rank statistic T+ is calculated as T + = X~ ybn r^. 

For the rank sum test, all x and Y are put together and 
the pooled data is ranked. Let Sj denotes the rank of Y s , 



and then the Wilcoxon rank sum statistics is the sum of 

Ebn 
i Sj). Then, by checking the correspond- 
ing table, p-value can be calculated. 

Determination of differential enriched regions 

Once a p-value from Wilcoxon rank test and read ratio 

Ebn ^— ybn 
1 x i il / i X 2j' x t.i stands for the reads num- 
ber in each bin of sample A and x 2 j stands for the reads 
number in each bin of sample B) and difference 



(d 



Ebn 



bn 



7=i 



2 j 



have been calculated, a set 



of significant differential enriched regions in each sam- 
ple are determined based on the user defined cutoffs. 

Definition of distinct binding patterns 

Each differential enriched region, identified from compari- 
son of a sample A with a sample B, is classified into one of 
the following three distinct binding patterns: 1) Sample 
A/B "Unchanged" - one of two corresponding differential 
enriched regions in separate samples that are associated 
with the same annotated gene and are within 1 kb distance 
of each other; 2) Sample A/B "Shift"- one of two corre- 
sponding differential enriched regions in separate samples 
that are associated with the same gene, but are more than 
1 kb apart from one another; and 3) Sample A/B "Only" - 
a differential enriched region associated with a gene in one 
sample, but no differential enriched regions are associated 
with that gene in the other sample. Note that the program, 
by default, associates a gene with a differential enriched 
region if they are within 100 kb of each other. 

Benchmark data 

The ChlP-seq of histone modification data from Mikkel- 
sen et al [18] (the raw data can be downloaded at ftp:// 
ftp.broad.mit.edu/pub/papers/chipseq/Mikkelsen2007/ 
alignments/) including H3K27me3 (K27), H3K4me3 
(K4) and H3K36me3 (K36) were used for the bench- 
marking test to evaluate the performance of our pro- 
gram QChlPat. 

Data output 

QChlPat provides comprehensive output information 
including a summary report, differential enriched 
regions in each sample with BED format and UCSC web 
browser and Affymetrix Integrated Genome Browser 
compatible wiggle (.wig) for the visualization, as well as 
the assigned binding pattern of differential enriched 
regions. 

Data sets for two study cases 

ChlP-seq of H3K27me3 and H3K9me2 in AKTl-tranfected 
MCF710A cells is obtained from our previous study [21]. 
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The ChlP-seq of TCF7L2 in MCF7 and PANC1 cells were 
obtained from our previous study [32]. 

Additional material 



Additional file 1 
Additional file 2 

Additional file 3: Table SI. Number of DHMSs identified by different 
programs for identifying enriched regions. 
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